NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

3GOLD: optimized Levenshtein distance for clustering third-generation sequencing data

https://doi.org/10.1186/s12859-022-04637-7

Logan, Robert; Fleischmann, Zoe; Annis, Sofia; Wehe, Amy Wangsness; Tilly, Jonathan L.; Woods, Dori C.; Khrapko, Konstantin (December 2022, BMC Bioinformatics)

Abstract Background Third-generation sequencing offers some advantages over next-generation sequencing predecessors, but with the caveat of harboring a much higher error rate. Clustering-related sequences is an essential task in modern biology. To accurately cluster sequences rich in errors, error type and frequency need to be accounted for. Levenshtein distance is a well-established mathematical algorithm for measuring the edit distance between words and can specifically weight insertions, deletions and substitutions. However, there are drawbacks to using Levenshtein distance in a biological context and hence has rarely been used for this purpose. We present novel modifications to the Levenshtein distance algorithm to optimize it for clustering error-rich biological sequencing data. Results We successfully introduced a bidirectional frameshift allowance with end-user determined accommodation caps combined with weighted error discrimination. Furthermore, our modifications dramatically improved the computational speed of Levenstein distance. For simulated ONT MinION and PacBio Sequel datasets, the average clustering sensitivity for 3GOLD was 41.45% (S.D. 10.39) higher than Sequence-Levenstein distance, 52.14% (S.D. 9.43) higher than Levenshtein distance, 55.93% (S.D. 8.67) higher than Starcode, 42.68% (S.D. 8.09) higher than CD-HIT-EST and 61.49% (S.D. 7.81) higher than DNACLUST. For biological ONT MinION data, 3GOLD clustering sensitivity was 27.99% higher than Sequence-Levenstein distance, 52.76% higher than Levenshtein distance, 56.39% higher than Starcode, 48% higher than CD-HIT-EST and 70.4% higher than DNACLUST. Conclusion Our modifications to Levenshtein distance have improved its speed and accuracy compared to the classic Levenshtein distance, Sequence-Levenshtein distance and other commonly used clustering approaches on simulated and biological third-generation sequenced datasets. Our clustering approach is appropriate for datasets of unknown cluster centroids, such as those generated with unique molecular identifiers as well as known centroids such as barcoded datasets. A strength of our approach is high accuracy in resolving small clusters and mitigating the number of singletons.
more » « less
Full Text Available
Evaluating the generalisability of neural rumour verification models

https://doi.org/10.1016/j.ipm.2022.103116

Kochkina, Elena; Hossain, Tamanna; Logan, Robert L.; Arana-Catania, Miguel; Procter, Rob; Zubiaga, Arkaitz; Singh, Sameer; He, Yulan; Liakata, Maria (January 2023, Information Processing & Management)

Full Text Available
LUCS: a high-resolution nucleic acid sequencing tool for accurate long-read analysis of individual DNA molecules

https://doi.org/10.18632/aging.103171

Annis, Sofia; Fleischmann, Zoë; Logan, Robert; Mullin-Bernstein, Zachary; Franco, Melissa; Saürich, Josefin; Tilly, Jonathan L.; Woods, Dori C.; Khrapko, Konstantin (April 2020, Aging)
null (Ed.)
Full Text Available
Barack’s Wife Hillary: Using Knowledge Graphs for Fact-Aware Language Modeling

https://doi.org/10.18653/v1/P19-1598

Logan, Robert; Liu, Nelson F.; Peters, Matthew E.; Gardner, Matt; Singh, Sameer (July 2019, Annual Meeting of the Association for Computational Linguistics (ACL))

Modeling human language requires the ability to not only generate fluent text but also encode factual knowledge. However, traditional language models are only capable of remembering facts seen at training time, and often have difficulty recalling them. To address this, we introduce the knowledge graph language model (KGLM), a neural language model with mechanisms for selecting and copying facts from a knowledge graph that are relevant to the context. These mechanisms enable the model to render information it has never seen before, as well as generate out-of-vocabulary tokens. We also introduce the Linked WikiText-2 dataset, a corpus of annotated text aligned to the Wikidata knowledge graph whose contents (roughly) match the popular WikiText-2 benchmark. In experiments, we demonstrate that the KGLM achieves significantly better performance than a strong baseline language model. We additionally compare different language model’s ability to complete sentences requiring factual knowledge, showing that the KGLM outperforms even very large language models in generating facts.
more » « less
Full Text Available
Knowledge Enhanced Contextual Word Representations

https://doi.org/10.18653/v1/D19-1005

Peters, Matthew E.; Neumann, Mark; Logan, Robert; Schwartz, Roy; Joshi, Vidur; Singh, Sameer; Smith, Noah A. (January 2019, Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP))

Contextual word representations, typically trained on unstructured, unlabeled text, do not contain any explicit grounding to real world entities and are often unable to remember facts about those entities. We propose a general method to embed multiple knowledge bases (KBs) into large scale models, and thereby enhance their representations with structured, human-curated knowledge. For each KB, we first use an integrated entity linker to retrieve relevant entity embeddings, then update contextual word representations via a form of word-to-entity attention. In contrast to previous approaches, the entity linkers and self-supervised language modeling objective are jointly trained end-to-end in a multitask setting that combines a small amount of entity linking supervision with a large amount of raw text. After integrating WordNet and a subset of Wikipedia into BERT, the knowledge enhanced BERT (KnowBert) demonstrates improved perplexity, ability to recall facts as measured in a probing task and downstream performance on relationship extraction, entity typing, and word sense disambiguation. KnowBert’s runtime is comparable to BERT’s and it scales to large KBs.
more » « less
Full Text Available

Search for: All records